Distributed High Dimensional Information Theoretical 
Image Registration via Random Projections^ 



Zoltan Szabo*, Andras Lorincz 

Eotvos Lordnd University, Department of Software Technology and Methodology 
Pdzmdny Peter setdny 1/C, Budapest, H-1117, Hungary 



Abstract 

Information theoretical measures, such as entropy, mutual information, and 
various divergences, exhibit robust characteristics in image registration ap- 
plications. However, the estimation of these quantities is computationally 
intensive in high dimensions. On the other hand, consistent estimation from 
pairwise distances of the sample points is possible, which suits random projec- 
tion (RP) based low dimensional embeddings. We adapt the RP technique to 
this task by means of a simple ensemble method. To the best of our knowl- 
edge, this is the first distributed, RP based information theoretical image 
registration approach. The efficiency of the method is demonstrated through 
numerical examples. 

Keywords: random projection, information theoretical image registration, 
high dimensional features, distributed solution 



1. Introduction 

Machine learning methods are notoriously limited by the high dimensional 
nature of the data. This problem may be alleviated via the random projection 
(RP) technique, which has been successfully applied, e.g., in the fields of 
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classification [H, 0, 0] , clustering independent subspace analysis jij], search 
for approximate nearest neighbors 0] , dimension estimation of manifolds 
estimation of geodesic paths {§J, learning mixture of Gaussian models 

data stream computation [ill . [12[ 

We note 



compression of image and text data [10 



14 



and reservoir computing [13J. For a recent RP review, see 
that the RP technique is closely related to the signal processing method of 
compressed sensing jl5| . 

As it has been shown recently in a number of works 16|, |17|, ll8|, ll9) , the RP 
approach has potentials in patch classification and image registration. For 
example, 16( combines the votes of random binary feature groups (ferns) for 
the classification of random patches in a naive Bayes framework. Promising 
registration methods using and Lo (Euclidean distance, correlation) norms 
have been introduced in [17| and |l8|, [l9| , respectively. 

Information theoretical cost functions, however, exhibit more robust prop- 
erties in multi- modal image registration 20J, |2l|, |22] . Papers 20, |21( apply 
k-nearest neighbor based estimation. However, the computation of these 
quantities is costly in high dimensions 23( and the different image properties 
(e.g., colors, intensities of neighborhood pixels, gradient information, output 
of spatial filters, texture descriptors) may easily lead to high dimensional 
representation. The task is formulated as the estimation of discrete mutual 
information in 22] and the solution is accomplished by equidistant sampling 
of D points from randomly positioned straight lines. The method estimates 
a histogram of N 2D bins, where N is the number of bins of the image, which 
may considerably limit computational efficiency 

Here we address the problem of information theoretical image registra- 
tion in case of high dimensional features. Particularly, we demonstrate that 
Shannon's multidimensional differential entropy can be efficiently estimated 
for high dimensional image registration purposes through RP methods. Our 
solution enables distributed evaluation. The presented approach extends the 
method presented in ^ in the context of independent subspace analysis (ISA) 



24J , where we exploited the fact that ISA can be formulated as the optimiza- 



tion problem of the sum of entropies under certain conditions 25j. Here, to 



our best knowledge, we present the first distributed RP based information 
theoretical image registration approach. 

The paper is structured as follows: In Section [2] we shortly review the im- 
age registration problem as well as the method of random projections. Sec- 
tion[3]formulates our RP based solution idea for image registration. SectionH] 
contains the numerical illustrations. Conclusions are drawn in Section 
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2. Background 

First, we describe the image registration task (Section 12. 1 p followed by 
low distortion embeddings and random projections (Section 12. 2p . 

2.1. The Image Registration Problem 

In image registration one has two images, I ref and I test , as well as a family 
of geometrical transformations, such as scaling, translation, affine transfor- 
mations, and warping. We assume that the transformations can be described 
by some parameter and let © denote the set of the possible parameters. 
Let transformation with parameter 8 G © on I test produce I test (0). The goal 
of image registration is to find the transformation (parameter 0) for which 
the warped test image I test (0) is the 'closest' possible to reference image I ref . 
Formally, the task is 

J(0) = 5(I ref ,I test (0))^max, 

where the similarity of two images is given by the similarity measure S. Reg- 
istration depends on the similarity measure and one use - among other things 
- L 2 and L\ norm, or different information theoretical similarity measures. 

Let feature f(p',T) G K D denote the feature of image I associated with 
pixel pel. In the simplest case, the feature is the pixel itself, but one 
can choose a neighborhood of the pixel, edge information at and around the 
pixel, the RGB values for colored images, or combinations of these. For 
registrations based on the L q norm (q G {1, 2}), the cost function takes the 
form 

JM = -J2\\f(p^) ~ f(p-d tC st(0))\\ q , 

p 

where for a vector v g R D ||v|L 

= (52iLi \ v i\ q )^ q - Instead of the similarity of 
features in || - 1] ^ and ||-|| 2 norms, one might consider similarity by means of 
information theoretical concepts. An example is that we take the negative 
value of the joint entropy of the features of images I ref , I test (0) as our cost 
function [26]: 

J H {0) = -tf(I ref ,I test (0)), (1) 



where H denotes Shannon's multidimensional differential entropy [27J. One 
may replace entropy H in ([T]) by other quantities, e.g., by the Renyi's a- 
entropy, the a-mutual information, and the a-divergence, to mention some 
of the candidate similarity measures 2^ . 
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2.2. Low Distortion Embeddings, Random Projection 

Low distortion embedding and random projections are relevant for our 
purposes. Low distortion embedding intends to map a set of points of a 
high dimensional Euclidean space to a much lower dimensional one by pre- 
serving the distances between the points approximately. Such low dimen- 
sional approximate isometric embedding exists according to the Johnson- 
Lindenstrauss Lemma (28| : 

Lemma (Johnson-Lindenstrauss). Given a number e £ (0, 1) and a point set 
{v l ,...,v T } C M, D of T elements. Then for d = 0(\n(T)/e 2 ) there exists a 
Lipschitz mapping f : R D R d such that 

(1 - e) \\v t - % || 2 < \\f( Vi ) - /( % )|| 2 < (1 + e) \\v t - Vj \\ 2 (2) 

for any 1 < i, j < T. 

During the years, a number of explicit constructions have appeared for 
the construction of /. Notably, one can show that the property embraced by 
(J2J) is satisfied with probability that approaches 1 for random linear mapping 
(f(y) = Pv, P G R dxD ) provided that P is chosen to project to a random 
c?-dimensional subspace |29j | . 

Less strict conditions on P are also sufficient and many of them decreases 
computational costs. Introducing the notation P = ^R, i.e., /(v) = ^Rv, 

R g M dxD , it is sufficient that matrix elements of matrix R = [r^jare 
drawn independently from the N(0, 1) standard normal distribution (3o|F1 
Other explicit constructions for R include Rademacher and (very) sparse 



distributions 3l|, |32[. More general methods are also available based on 



weak moment constraints 33, 34 . 



3. Method 

In image registration information theoretical registration measures show 
robust characteristics when compared with Li and L 2 measures^], e.g., in 21 



directly for (0Q), and for a-entropy, a- mutual information, and a-divergence in 



20(. However, these estimations have high computational burdens since the 



1 Multiplier in expression P = -^jB means that the length of the rows of matrix P 
is not strictly one; it is sufficient if their lengths are 1 on the average. 

2 The family of L 2 measures include the correlation defined by the scalar product. 
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dimension of the features in the cited references are 25 and 64 respectively. 
Here, we deal with the efficient estimation of cost function (prj that from now 
on we denote by J. We note that the idea of efficient estimation can be used 
for a number of information theoretical quantities, provided that they can 
be estimated by means of pairwise Euclidean distances of the samples. 
Central to our RP based distributed method are the following: 

1. The computational load can be decreased by 
(a) dividing the samples into groups and then 



(b) computing the averages of the group estimates [2l| . 
We call this the ensemble approach. 
2. Estimation of the multidimensional entropy cost function J can be 
carried out consistently by nearest neighbor methods using pairwise 



Euclidean distances of sample points |35|, |36|, [37 



Taking into account that low dimensional approximate isometric embed- 
ding of points of high dimensional Euclidean space can be addressed by the 
Johnson-Lindenstrauss Lemma and the related random projection methods, 
we suggest the following procedure for distributed RP based entropy (and 
thus J) estimation: 

1. divide the T feature samples^ {v(l), . . . , v(T)} C 1R D into iV groups 
indexed by sets 7i, . . . , Ijy so that each group contains G samples, 

2. for all fixed groups take the random projection of v as 



vf (t) := R n v(t) (t e I n ; n = 1, . . . , N; R n e WL dxD ), 



Note: normalization factor ^ can be dropped in P = ^R since it be- 
comes an additive constant term for the case of the differential entropy, 
H(cv) = H( Y ) + d\og(\c\). 
3. average the estimated entropies of the RP-ed groups to get the estima- 
tion 

n=l 

In the next section we illustrate the efficiency of the proposed RP based 
approach in image registration. 



3 In the image registration task the set of feature samples is {[f(p, I re f); /(pTtest(^))]} 
where p is the running index and [a; b] denotes the concatenation of vectors a and b. 



5 



4. Illustrations 



In our illustrations, we show examples that enable quantitative evaluation 
and reproduction: 

1. We use 512 x 512 images: 

(a) In test Lena, we register the rotated versions of the red and green 
channels of image Lena, see Fig. QJa). 

(b) In the mandrill test we register the rotated versions of the gray- 
scale image of a mandrill baboon and its Sobel filtered version, 
see Fig. [0(b) . 

2. we chose to evaluate the objective function for angles from —10° to 
10° by steps 0.5° and in interval [—1°, 1°] by steps 0.1°. In the ideal case 
the optimal degree 6* is 0. Our performance measure is the deviation 
from the optimal value. 

In our simulations, 

• we chose the D = (2h + 1) x (2h + 1) rectangle around each pixel as 
the feature / of that pixel. 

• coordinates of the R n RP matrices were drawn independently from 
standard normal distribution, but more general constructions could 



also be used 33, 34 



for each individual parameter, 10 random runs were averaged. Our 
parameters included h, the linear size of the neighborhood that deter- 
mines dimension D of the feature, G, the size of the randomly projected 
groups and d, the dimension of RP. 

performance statistics are summarized by means of notched boxed 
plots, which show the quartiles {Qi,Q2,Qs), depict the outliers, i.e., 
those that fall outside of interval [Qi — 1.5(Q 3 — Qi), Q 3 + 1.5(Q 3 — Qi)] 
by circles, and whiskers represent the largest and smallest non-outlier 
data points. 

we studied the efficiency of five different entropy estimating methods 
in (j3J) including 



— the recursive k-d partitioning scheme 38 

— the fc-nearest neighbor method 371 ]. 
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— generalized k-nearest neighbor graphs 39], 

— minimum spanning tree based entropy graphs 40|, |36) and 

— the weighted nearest neighbor method (4l| . 

The methods will be referred to as kdp, kNNk, kNNi_k, MST and 
wkNN, respectively, kdp is a plug-in type method estimating the un- 
derlying densit y di rectly, hence especially efficient for small dimensional 



(d) problems. (39J extends the approach of [37j (S = {k}) to an arbi- 
trary S neighborhood subset (S C {1, . . . , k}). In our experiments, we 
set S = {1, . . . , k}. Instead of k-nearest graphs the total sum of pair- 
wise distances is minimized over spanning trees in the MST method. 
The kNNk, kNN^k, MST constructions belong to the general umbrella 
of quasi-additive functionals jio| providing statistically consistent esti- 
mation for the Renyi entropy (H a ) [42| . The Shannon's entropy H is 
a special case of this family since lime,-;.! H a = H. In our simulations, 
we chose a = 0.95. k, the number of neighbors in kNN^, kNNi_^ was 
5. Finally, the wkNN technique makes use of a weighted combination 
of k-nearest neighbor estimators for different k values. 

• h, the neighborhood parameter was selected from the set {1, 2, 3, 4, 5, 10, 20, 30}. 

• G, the size of groups and d, the RP dimension took values 20, 50, 100, 
1, 000, 10, 000, 100, 000 and 1, 2, 5, 8, respectively. 

• the T feature points were distributed randomly into groups of size G 
in order to increase the diversity of the individual groups. 

In the first set of experiments we focused on the precision of the 
estimations on the Lena dataset. According to our experiences 

• there is no relevant /visible difference in the precision of the estimations 
for h G {1,2,3,4,5,10}. The estimation is even of high precision for 
h = 10 that we illustrate in Fig. [2](a)-(b) for the kdp technique. The 
estimation errors are quite similar for h = 20 and 30, the latter is 
shown in Fig. [^c)-(d). Here, one can notice a small uncertainty in the 
estimations for smaller RP dimensions (d = 1,2), which is moderately 
present for larger d values (d — 5,8) - except for the largest studied 
group size G = 100, 000. 
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• the obtained results are of similar precision on this dataset for all the 
studied entropy estimators. We illustrate this property for the most 
challenging h = 30 value in Fig. [3] and Fig. HI 

In the second set of experiments we were dealing with the mandrill 
dataset, where different modalities of the same image (pixel, edge filtered 
version) had to be registered. Here, 

• the kdp approach gradually deteriorates as the dimension of the under- 
lying feature representation is increasing, i.e., as a function of h. For 
h — 1, the method gives precise estimations for d — 1,2 and small group 
sizes (G = 20,50, 100); other parameter choices result in uncertain es- 
timations, see Fig. E^a)-(b). By increasing the size of neighborhood 
(h), the estimations gradually break down. For h = 5, the precisions 
are depicted in Fig. E)^c)-(d); the estimations are still acceptable. For 
h = 10 we did not obtain valuable estimations for the kdp technique. 

• in contrast to the kdp method, the kNN^, kNNi_^, MST and wkNN 
techniques are all capable of coping with the h = 5 and h = 10 values, 
as it is illustrated in Fig. [6j Fig. [7J Fig. [8] and Fig. [9j respectively. It 
can also be observed, that the RP dimension must be d > 2 here, and 
in case of d = 5, 8 one obtaines highly precise/certain estimations. 

• the only method which could cope with the increased h = 20 neighbor 
size value, was the wkNN technique. This result could be achieved for 
RP dimension d = 8 making use of small group sizes (G = 20, 50, 100), 
see Fig. dUJ 

The computation times are illustrated for the Lena {h = 30) and man- 
drill dataset {h = 5) for the kdp method in Fig JTlT a) and Fig JTlT b). respec- 
tively. As it can be seen, the ensemble approach with group size G = 20 — 100 
may speed up computations by several orders of magnitudes; similar trends 
can be obtained for the other estimators, too. Among the studied methods, 
the kdp technique was the most competitive in terms of computation time. 
We also present the computation times for the largest studied problem, Lena 
with h = 30; compared to kdp 

• the kNNk and kNNi_k techniques were within a factor of 1.5 — 2.5 in 
terms of computation time, 
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• the wkNN method was 1.5 — 2 (3 — 4) times slower compared to the 
kdp approach in case of G < 1, 000 (G = 10, 000), and 

• the MST based estimator was within a factor of 1.5 — 2 compared to 
kdp in case of G < 100, and more than 6 times slower for G = 1, 000. 

As it can be seen in Fig. [Til the application of the reduced RP dimension 
can be advantageous in terms of computation time. Moreover, compared to 
schemes without dimensionality reduction (d = D and R„ = I, Vn), i.e., work- 
ing directly on raw data, the presented RP based dimensionality approach 
can heavily speed-up computations. This behaviour is already present for 
h — 5, as it is illustrated for d = 2 on the mandrill dataset in Table [TJ 

Considering the possible (h, d, G) choices, according to our numerical 
experiences, 

• often, small d = 2 — 5 RP dimensions give rise to reliable estimations 
for several entropy methods, 

• it is necessary to slowly increase d as a function of the dimension of the 
feature representation (parameterized by h), 

• in the studied parameter domain, group sizes of G = 50 — 100 could pro- 
vide precise estimations, and simultaneously open the door to massive 
speed-up by distributed solutions. 

These results demonstrate the efficiency of our RP based approach. 
5. Conclusions 

We have shown that the random projection (RP) technique can be adapted 
to distributed information theoretical image registration. Our extensive nu- 
merical experiments including five different entropy estimators demonstrated 
that the proposed approach (i) can offer orders of magnitude in computation 
time, and (ii) provides robust estimation for large dimensional features. 

It is very promising since it is parallel and fits multi-core architectures, 
including graphical processors. Since information theoretical measures are 
robust, our method may be useful in diverse signal processing areas with the 
advance of multi-core hardware. 
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G = 20 


G = 50 


G = 100 


G = 1000 


kdp 


1.01 


1.27 


1.29 


2.92 


kNN!_ fe (kNN fc ) 


2.25 


5.71 


12.65 


125.84 


MST 


1.15 


1.36 


1.65 


1.82 


wkNN 


1.30 


2.59 


4.88 


16.05 



Table 1: Computation time versus raw data based method. Value z > 1, 
means z times improvement in computation time over the method not ap- 
plying dimension reduction. Dataset: mandrill. Baseline: RP dimension 
d = 2. Neighbor size: h = 5. 
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Figure 2: Estimation error as a function of the RP dimension d on the Lena 
dataset for different G group sizes. Method: kdp. (a)-(b): h = 10. (c)-(d): 
h = 30. First column: d — 1, 2. Second column: d — 5, 8. 
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Figure 3: Estimation error as a function of the RP dimension d on the Lena 
dataset for different G group sizes. Neighbor size: h = 30. (a)-(b): kNN k 
method, (c)-(d): kNNi_ k method. First column: d — 1, 2. Second column: 
d= 5, 8. 
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Figure 4: Estimation error as a function of the RP dimension d on the Lena 
dataset for different G group sizes. Neighbor size: h = 30. (a)-(b): MST 
method, (c)-(d): wkNN method. First column: d — 1, 2. Second column: 
d = 5, 8. 
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Figure 5: Estimation error as a function of the RP dimension d on the 
mandrill dataset for different G group sizes. Method: kdp. (a)-(b): neighbor 
size h — 1. (c)-(d): h = 5. First column: d — 1, 2. Second column: d — 5, 8. 
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Figure 6: Estimation error as a function of the RP dimension d on the man- 
drill dataset for different G group sizes. Method: kNN k . (a)-(b): neighbor 
size h = 5. (c)-(d): h = 10. First column: d = 1, 2. Second column: d = 5, 
8. 
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Figure 7: Estimation error as a function of the RP dimension d on the 
mandrill dataset for different G group sizes. Method: kNNi_ k . (a)-(b): 
neighbor size h = 5. (c)-(d): h = 10. First column: d = 1, 2. Second 
column: = 5, 8. 
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Figure 8: Estimation error as a function of the RP dimension d on the 
mandrill dataset for different G group sizes. Method: MST. (a)-(b): neighbor 
size h = 5. (c)-(d): h = 10. First column: d = 1, 2. Second column: d = 5, 
8. 
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Figure 9: Estimation error as a function of the RP dimension d on the man- 
drill dataset for different G group sizes. Method: wkNN. (a)-(b): neighbor 
size h = 5. (c)-(d): h = 10. First column: d = 1, 2. Second column: d = 5, 
8. 
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Figure 10: Estimation error for RP dimension d — 8 and for different G 
group sizes. Method: wkNN. Neighbor size: h = 20. 
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Figure 11: Computation time as a function of the RP dimension for different 
G group sizes with log scale on the y axis. Method: kdp. (a): Lena dataset, 
neighbor size h = 30. (b): mandrill dataset, h = 5. 
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